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Abstract 

The study of high energy cosmic rays requires detailed Monte Carlo simulations of 
both, extensive air showers and the detectors involved in their detection. In par- 
ticular, the energy calibration of several experiments is obtained from simulations. 
Also, in composition studies simulations play a fundamental role because the pri- 
mary mass is determined by comparing experimental with simulated data. At the 
highest energies the detailed simulation of air showers is very costly in processing 
time and disk space due to the large number of secondary particles generated in 
interactions with the atmosphere. Therefore, in order to increase the statistics, it 
is quite common to recycle single showers many times to simulate the detector re- 
sponse. As a result, the events of the Monte Carlo samples generated in this way 
are not fully independent. In this work we study the artificial effects introduced by 
the multiple use of single air showers for the detector simulations. In particular, 
we study in detail the effects introduced by the repetitions in the kernel density 
estimators which are frequently used in composition studies. 

Key words: Cosmic Rays, Air Showers, Detector Simulations 



1 Introduction 



The spectrum of cosmic rays extends over more than eleven orders of magni- 
tude, starting at i^^ ~ 10^ eV up to energies above 10^° eV. Above ~ 10^^ eV 
they are too infrequently to be detected by balloons or spacecraft. Therefore, 
the detection techniques used in this energy range are based in the properties 
of the extensive air showers produced by them in the atmosphere. There are 
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essentially two techniques for shower detection [1]: (i) arrays of surface de- 
tectors which measure the lateral distribution of the secondary particles that 
reach the Earth surface and (ii) measurements of the fluorescence light emit- 
ted by atmospheric nitrogen excited by charged particles of the shower as they 
traverse the atmosphere. 

Air shower and detector simulations play a fundamental role in the study 
of cosmic rays. In particular, arrays of surface detectors that do not have 
fluorescence telescopes to calibrate the energy scale, must resort to simulated 
data in order to estimate the energy of the primary particle. Furthermore, the 
primary mass is also obtained comparing experimental data with simulations. 

There are several Monte Carlo programs for air shower simulation, the most 
used in the literature are AIRES [2], CORSIKA [3], and CONEX [4], the latter 
for a fast simulation of the longitudinal shower development. Since the number 
of particles produced in a shower can be extremely large, e.g., ~ 10^^ for a 10^'^ 
eV proton shower, the computer processing time and disk space needed are 
also very large, even if unthinning methods [5,6] are used. Due to this difficulty 
it is a common practice to reuse the same shower for generating several events 
(see for example [7,8,9,10,11,12]). This practice is more common in simulations 
that includes surface detectors because, for fluorescence telescopes, very fast 
Monte Carlo programs like CONEX, introduced few years ago, have very fast 
and efficient algorithms for the generation of longitudinal profiles. 

In this work we study the effects of using multiple repetitions of individual 
showers, applied to the simulation of detectors, on the evaluation of stan- 
dard estimators of the expected value, variance, and covariance as well as on 
histograms corresponding to observable parameters. We study in detail the 
effects introduced in the kernel density estimators, which are analytical esti- 
mates of the underlying distribution function obtained from a finite sample of 
events. In cosmic rays physics this technique is used mainly in connection with 
composition analyses [13,14,15,16,17,18]; however, it is also extensively used 
in many different areas of knowledge [19] to which this work can be directly 
extended. 

As a numerical example, we discuss the effects of repetitions on samples of 
the Xjnax parameter, the atmospheric depth at which an air shower reach its 
maximum development, obtained with the package CONEX. 



2 Analytical Treatment 

As mentioned in the introduction, we want to study the potential distortions 
introduced by reusing individual showers to maximize the statistics when sim- 
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ulating the response of a detector. Let us start with the optimum case in which 
each individual shower is used only once and, therefore, best reproduces real- 
ity. 



Let y be a d-dimensional vector composed by physical observables (e.g. mass 
sensitive parameters) distributed as g{y) and let z be a random vector, dis- 
tributed as h{z), that takes into account the effects of the detectors and the 
corresponding reconstruction method such that, after measuring and recon- 
structing the empirical information, a vector x = y -|- z is obtained. The 
distribution function of x is the convolution of g{y) and /i(z). 



Suppose that we have a sample of N independent events of the distribution / 
The probability of this configuration can be written as. 



However, as previously noted, if single showers are recycled and used many 
times to simulate the response of the detectors, non-independent samples are 
obtained. If we use each shower of a sample of M independent showers m times 
to simulate the detectors response, the following sample of size N — M x m 
is obtained, 

xii=yi -Fzii 
xmi = Ym + zmi 

where the notation used henceforth corresponds to where i is the ith 
coordinate of vector ^, a indicates the number of independent shower and 




(1) 



-P(yi . . . yjv, zi . . . zat) =g{yi) ■ ■ 
P(xi...x^) = /(xi).. 



^(yjv) h{zi) . . ./i(zjv), 
/(xiv). 



(2) 
(3) 
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a the number of detector simulation performed using the a-th shower. The 
probabihty of such a configuration is given by 



M m 

P{yi . . . YiV, Zii . . . ZMm) = n dija) H K'^o^a) (4) 

a=l a=l 

M „ m 

P(xii . . . XMm) = Yl dya g{ya) n ^(^"« ~ (^) 

a=l a=l 



2.1 Mean, variance and covariance estimators 



Let us consider the average of the ith coordinate of x, x\ for the realistic case 
in which each shower is used only once to simulate the detector response, 

^^ = ^E<- (6) 

a=l 

By using Eq. (3) it is easy to obtain the very well known expressions for the 
expected value and variance of x\ 



E[x']^E[x'] (7) 
Var[x'] = ^ Var[x% (8) 



The usual estimator of the covariance between two random variables is given 

by, 



1 ^ 



Y E(4-^')(4-^^)- (9) 



For i = j the estimator of the variance of is obtained, sf — Cu. By using 
Eq. (3) it can be shown that both estimators are non-biased. 



E[Cij]=cov[x\x^], (10) 
E[s',]^E[Cu]^Var[x% (11) 



For the case in which each shower is used several times to simulate the response 
of the detectors the average of is given by, 

1 M m 

Mm titi 
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Prom Eqs. (5,12) it can be shown that, 



E[x"] = E[x'], (13) 

1 777 — \ r 

Var[x"] = ^ Var[T'] + j dydxidxs {x\ - E[x']){x\ - E[x']) x 

^(y) Mxi-y) Mx2-y), (14) 

which means that using samples obtained by reusing individual showers to 
simulate the detector response does not introduce any bias when calculating 
the average. However the fluctuations of ,x* are increased by the generation of 
an additional term proportional to (m — 1)/Mm. 

If the response of the detectors and the reconstruction methods do not in- 
troduce any bias on the physical magnitudes y, i.e., / dn h{n) — 0, the 
variance of can be written as. 



1 777 — 1 

where yar[x'] = Farfy*] + yarfz'] is used to obtain the last equation. 

The estimator of the covariance, between a;* and x^ , including multiple repe- 
titions of the individual showers takes the form, 

-1 Mm 
a=l a=l 



The expected value of the covariance estimator is obtained from Eqs. (5) and 
(17), 



111 — \ c 

E[C[j]=cov[x\x^] - J dydx^dx2 {x{ - E[x']){xi - E[x^]) x 

^(y)Mxi-y)Mx2-y). (18) 

Therefore, as expected, the repetition of individual showers introduces a bias 
in the covariance estimator because the events are not independent. The bias 
results proportional to (m — 1)/Mm. 

As mentioned before, the expected value of the variance estimator is obtained 
setting i = J in Eq. (18), 
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E[s"^] = Var[x'] - J ciycZxidx^ {x\ - E[x']){xl - E[x']) g{y) x 

/i(xi-y) Mx2-y), (19) 

which shows that also s'f is now a biased estimator of the variance of x^. 

For the case in which the detectors and reconstruction methods do not intro- 
duce any bias Eqs. (18,19) become, 



E[C'J^cov[x\x^ 



m 



1 



E[s'^^] = Var[x'] - 



Mm 
m — 1 



cov[y\y% 



Mm 



Var[y'']. 



(20) 
(21) 



2. 2 Histogram fluctuations 



The fluctuations in each bin of a histogram are also modified by the repetition 
of individual showers. For simplicity let us consider a one-dimensional his- 
togram of Nb bins, such that a given x belongs to the kth bin if a; e [t^, tk+^t] , 
where At is size of the bin. 

The fluctuations in the content Uk of the kth-hui of a histogram follows a 
binomial distribution. Therefore, the expected value and the variance of Uk 
are given by. 



E[nk] = Npk, (22) 
Var[nk]^Npt,{l-pt,), (23) 

where 

Pk= r^'dxfix), (24) 
Jtk 

with f{x) —go h{x). 

The random variable corresponding to a sample of m repetitions of each 
individual shower can be written as 

M m 

"-fe = 21 [0(^aa - tk) - ©(a^aa " ^fe+l)] , (25) 
a=l a=l 

where 0(a;) = 1 if a; > and Q{x) = otherwise. Written in this way it is 
easy to calculate the expected value and variance of Uk, 
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Fig. 1. Gaussian distributions f{x), g{x) and h{x) used in the example of the artifi- 
cial fluctuations introduced on histograms as a result of reusing individual showers 
to simulate the detector response. f{x) (solid line) is the result of the convolution 
between the other two distribution functions, g{x) and h{x). 



E[ni]=Mm pf,, 
Var[n'i.] = Mm pk{l — Pk) + Mm{m 



(26) 



J dy g{y) (^J^ ^ dxh{x - y) 



dy giv) 



dxh{x — y) 



(27) 



i.e., the mean value does not change and the variance has an extra term that 
increases with m. 

As an example, let us consider that g{x) and h{x) are two Gaussian dis- 
tributions centered at zero with ai = 3/2 and cr2 = 2, respectively, i.e., 
g{x) = G{x; 0, di) and h{x) = G{x; 0, (T2), where 



G{x; /i, cr) 



271 a 



exp 



- ^)^ 
2 (t2 



f28) 



The convolution of two Gaussian distributions is also a Gaussian, therefore, 
in this example f{x) is also a Gaussian centered at zero with = [a^ + cxg]^/^, 
i.e., /(x) = G{x] 0; (Xc). Figure 1 shows the three Gaussian distributions under 
consideration. 



If the bin size of the histogram is sufficiently small, then 



ftk+l 

/ dx A{x) = A{tk)At, 



(29) 
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t 

Fig. 2. Contour plot of o"[ny /cr[n/;] as a function of the lower limit of kth-hin t and 
the number of repetitions m. The number of independent showers and the bin size 
are M = 50 and At = 0.1, respectively. 



is a good approximation for any function A{x) considered in the example. 
Combining this approximation with Eq. (27) and using the Gaussian functions 
/(a;), g{x) and h{x), the following expression for the variance of ra^ is obtained 



Mmim — DAt^ 

Var[n'^] = Mm G{tk] 0, aJAt (1 - G{tk] 0, (xJAt) + \ _ ^ x 



2V? 



1 



1 



- Gitk, 0, Jal + al/2) - - G{tk; 0, ajV2) 

0"2 CTc 



(30) 



Figure 2 shows a contour plot of the ratio (T[ny/(T[nfc] = V arln'^^/"^ /V arlrikY/"^ ^ 
i.e., with (m > 1) and without (m = 1) the inclusion of shower repetitions, as 
a function of m and t, the lower limit of kth bin. The number of independent 
showers is taken as M = 50 and At = 0.1. From the figure it can be seen that 
the larger the number of repetitions the larger the fluctuations compared to 
the case m = 1. 



2.3 Density estimators 



The density estimation technique consist in obtaining an estimator of the 
underlying density function from a given data sample [19,20,21.22]. In one 
of the most widely used variants of that technique, a density estimator is 
obtained from a superposition of kernel functions centered at each event of 
the data sample. For (i-dimensional data the kernel density estimator can be 
written as, 

1 ^ 1 

/(x) = T7 E ^ K{H-'^' ■ (x - x«)), (31) 
a=i y\H\ 

where x is a d-dimensional vector, is a symmetric, positively defined matrix 
(i.e., the symmetric, positively defined square-root matrix H'^^^ exists) and 
i^'(u) is the kernel function. The matrix H gives the covariance between the 
different pairs of variables and also the degree of smoothing, i.e., the width of 
the kernel function. 

From Eqs. (3) and (31) the expected value of the density estimator is obtained, 
^[/(x)] = ^ fd^' K{H-'/' . (x - x')) /(x'), (32) 

which shows that /(x) is a biased estimator of /(x). 

There are several criteria to measure the goodness of the density estimator. 
In particular the mean square error MSE{^) — £'[(/(x) — /(x))^] is a natural 
criterion pointwise. Globally, MSE{yi) can be integrated over x to give the 
integrated mean square error, 

IMSE ^ Jdx MSE{^) = Jdx ^[(/(x) - /(x))^]. (33) 



It is easy to see that MSE{'x) = Var[/(x)] + i?ms^(x), where Var[f{x)] — 
E[/(x)2] - E[/(x)]2 and Btas\^) = (S[/(x)] - /(x))^. Then, 

IMSE ^ Jdx Var{x) + J dx Bias\x). (34) 



By using the Taylor expansion and retaining the dominant terms an approxi- 
mated expression for IMSE is obtained. 



IMSE ^ i / rfx 



du X(u)u^//^/'L>V(x)i/^/'u 



2 1 

+ 



N^ \H 



R{K), (35) 



where 
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. (36) 
R{A)^ j dnA^{n). (37) 

If if = V'^l"^ I h, where /i is a small parameter that parametrizes the degree 
of smoothing, the IMSE is written as, 



IMSE ^'^ jd^ j dn K{n)u^V^/^D^f{x)V^/^u 



+ . (38) 



Minimizing IMSE with respect to /i, the well known expression of hopt is 
recovered, 

where the constant of proportionahty depends on /(x), the unknown density 
function that we want to estimate. There are several methods to estimate the 
smoothing parameter h from the data sample (see section 3). 

Let us consider the case in which shower repetitions of individual showers are 
included. The density estimator in this case is given by, 

-j M m -1 

/'(x) = ^ EE K{H-"' ■ (X - x«„)), (40) 



It can be seen from Eqs. (5) and (40), that the bias does not change when the 
repetitions are introduced. However, as expected, the variance increases. 



yar[/'(x)] ^ R{K) /(x) + ^ x 

Mm J\H\ Mm 



Jdyg{y)h\^-y)-f{^)y (41) 

where just the leading terms are retained. Consequently, the IMSE takes in 
this particular case the form 



IMSE' Jd^ j dn K{u)u^V^/^D^f{yi)V^/^u + 

'J d^dyg{y)h\^-y)-R{f)y{42) 



R{K) ^ m-1 
Mm h'^ \ f\V\ Mm 



Eq. (42) shows that the leading term introduced by the repetitions does not 
depend on h and, therefore, the expression for hgpt remains equal to the m — 1 
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case. The only effect introduced by the repetitions of the individual showers 
is to increase the fluctuations of the estimator for each x. 



3 Numerical Example 

In this section a numerical example that shows the predicted effects introduced 
by the shower repetitions is given. For that purpose, air showers simulations are 
performed using the program CONEX. A total of Ngh = 11000 proton showers 
of primary energy E — 10^^ eV and zenith angle 9 — 30° are generated. 

Samples of the parameter X^ax obtained from the CONEX simulations are 
considered. A Gaussian uncertainty of o-[Xmaa;] = 20 g cm^^ and yU = is 
assumed in order to take into account the detector response and the recon- 
struction method. Therefore, the distribution function of the reconstructed 
Xmax is given by Eq. (1) with g{Xmax) the distribution function correspond- 
ing to the physical fluctuations and h{X) = G{X;0, a[Xrnax]) (see Eq. (28)) 
takes into account the response of the detectors and reconstruction methods. 

Four sets of 100 samples are considered. Each set of samples is noted as 5'(M,m) 
where M indicates the independent values of X^ax (obtained from CONEX) 
in each sample and m the number of repetitions of each shower, i.e., the 
number of times that the Gaussian distribution h{X) = G{X ; X^^^, (T[X,nax]) 
is sampled for each of the M independent values X^^^ in each individual 
sample. Therefore, 5'(iio,i), >S'(io,ii), •S'^j^^g,!) ^^^1 >S'(22,5) are considered, where 
'S'(iio,i) and 5'(^^o i) j^^t differ in the different values obtained from the Gaussian 
distribution performed to include the detector response and reconstruction 
method. The number of events in each sample, belonging to the different sets, 
is Nev = M X m = 110, the same for all kind of samples considered. 

Figure 3 shows the distributions of the estimators of the average, X,nax, and 
the standard deviation, s[Xmaa:]- for the sets of samples considered. It can be 
seen that, as expected, when the repetitions are included, the fluctuations in- 
crease and when the number of independent showers increases the fluctuations 
decrease. Figure 3 also shows that, although the distributions of sf-^^maa:] with 
repetitions have a tail towards larger values of grammage, which is not present 
in the corresponding without repetitions, the bias is not statistically signi- 
ficative. This is consistent with Eq. (21) which shows that the expected bias 
introduced by repetitions in the variance is proportional to (m — 1) /Mm = 0.1 
for ^o.ii)- 

In order to illustrate the effects of repetitions on the density estimators, one- 
dimensional Gaussian kernels are used to estimate the density function of 
^max- An adaptive bandwidth method, introduced by B. Silverman [19], is 
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Fig. 3. Distributions of Xmax and s[Xmax] for the different sets of samples con- 
sidered. Shadowed histograms correspond to samples without multiple repetitions. 



used to obtain better estimates of the density function. The procedure starts 
by performing a first estimation of the density function, from a given sample, 
using a Gaussian kernel with fixed smoothing parameter. 



1 



N 



NV2n a h 



J2 



i=l 



■max ^max) 



2 /i 



(7^ 



(43) 



where N is the size of the sample, a is the standard deviation of the data 
sample and = 1.06 x N~^^^ is the smoothing parameter corresponding to 
Gaussian samples which is used very often in the literature because it gives 
very good estimates even for non Gaussian samples. 



The following parameters are calculated by using the estimate obtained from 
Eq. (43), 



A,: 



k - 
max, 



1/N 



(44) 
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Fig. 4. Mean and one sigma regions for the density estimates obtained from the 
different samples considered. Darker regions and dotted hnes correspond to samples 
including multiples repetitions. 

and then, the final density estimate is obtained from, 

1 ^ 1 

where hi — ho Aj. 



(X 



max -^max) 

2 /i? a2 



(45) 



For each sample belonging to a given set a density estimate is obtained, there- 
fore, 110 density estimates are obtained for each set of samples considered. 
Figure 4 shows the mean value and the one sigma region obtained from the 
density estimates of each set. It can be seen that the mean values correspond- 
ing to samples with or without repetitions are very similar, which is consistent 
with the result obtained in subsection 2.3. Also, as expected from Eq. (41), 
the fluctuations corresponding to sets including repetition are larger and com- 
paring the results obtained for 5'(io,ii) and >S'(22,5) we see that the fluctuations 
in the latter case are smaller due to the smaller number of repetitions. 



4 Conclusions 



In this work we study the effects of recycling individual cosmic ray show- 
ers to simulate the detector response, which is a common practice in Monte 
Carlo simulations at the highest energies. We find that the standard estima- 
tors of the expected value, variance and covariance are modified. In particular, 
the average remains as a non-biased estimator of the expected value but the 
fluctuations are increased. For the standard estimators of the variance and 
covariance a bias proportional to (m — 1)/Mm appears when repetitions are 
included. Besides, as in the case of the average, the fluctuations of both esti- 
mators are increased. We also study the effects of repetitions in histograms. 
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where we find that the mean value of the bin content is unchanged but the fluc- 
tuations are in general larger, depending on the bin considered, and increase 
with the number of repetitions. 

Finally, we study the effects introduced by repetition in the kernel density 
estimators obtained from finite samples. We find again that the expected value 
of the estimator is unchanged, i.e., the bias takes the same form. However, the 
pointwise fluctuations are increased and become more important as the ratio 
(m — 1)/Mm increases. 
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